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ABSTRACT 

Learning a hashing function for cross-media search is very 
desirable due to its low storage cost and fast query speed. 
However, the data crawled from Internet cannot always guar¬ 
antee good correspondence among different modalities which 
affects the learning for hashing function. In this paper, we fo¬ 
cus on cross-modal hashing with partially corresponded data. 
The data without full correspondence are made in use to en¬ 
hance the hashing performance. The experiments on Wiki and 
NUS-WIDE datasets demonstrates that the proposed method 
outperforms some state-of-the-art hashing approaches with 
fewer correspondence information. 

Index Terms — Cross-modality, Hashing, Partial Corre¬ 
spondence, Multimedia Search 

1. INTRODUCTION 

Hashing techniques are increasingly popular for scalable sim¬ 
ilarity search in many applications Cl El. The basic task for 
hashing is to map the high-dimensional data into compact bi¬ 
nary codes which is very effictive in Approximate Nearest 
Neighbour (ANN) search. A good hashing function should 
make the data objects similiar in orginal feature space have 
the same or similiar hash codes. 

Many hashing approaches have been proposed in recent 
years. One of the most sucessful schemes is Local Sensitive 
Hashing (LSH) 0 which uses random projections to obtain 
the hashing functions. Inspired by the manifold theory, Sepc- 
tral Hashing ID and its extensions attempt to capture the lo¬ 
cal manifold properties to learn hashing functions. However, 
these methods are designed for unimodal tasks which cannot 
seek the correlations between multiple modalities. 

In modem multimedia retrieval, many applications in¬ 
volve data objects which consist of different modalities. For 
example, the wikipedia website provides both text and image 
descriptions for each entry. Learning the latent correlation 
between multiple modalities is valuable for cross-media re¬ 
trieval. The task of image-to-image, image-to-text and text- 
to-image query can be combined into a unified framework. A 
lot of works focus on cross-modal hashing for fast similiarity 
research HHIEEIIll. Bronstein et.al 0 firstly proposed the 
multimodal problem (MMSH) which learns the hashing func¬ 


tion between the relevant modalities. Kumar et.al 0 extends 
the spectral hashing to the multimodal scenery. Zhen et.al 
Q proposed the CRH model which is learned by boosting 
algorithms. Zhen et.al 0 directly learns the binary codes 
with the latent variable models. 

Although the multimodal flashing methods mentioned 
above have achieved good performance in many real datasets, 
they all require good correspondence between multiple 
modalities. More specifically, the image and text should 
be provided in pairs. However, a good matching between the 
text description and the image cannot be always guaranteed 
by the data crawled from Internet. The labels of images pro¬ 
vided by user are usually with noisy information. The images 
in webpages might to be unavailable due to some transfering 
problems. It is very expensive to ensure that each image-text 
pair is in correspondence. In this paper, we focus on how to 
make use of the large amount of images and structural texts 
without fully correspondence in multiple modalities to en¬ 
hance the hashing for cross-modal search. We propose Partial 
Correspondence Cross-Modal Hashing (PCCMH) to solve 
this problem as shown in Fig[T] In each modality, an anchor 
graph 0 is built to effectively capture the local manifold pre¬ 
serving the local smoothness. For well-corresponded data, we 
map the objects into Hamming spaces in which the modalities 
are tranformed into same or similiar binary codes. The hash¬ 
ing function is learned via maximizing the local smoothness 
in each modality and cross-modal correspondence. 

The remainder of this paper is organized as follows: The 
details of PCCMH is presented in Section]^ In Section]^ we 
conduct extensive experiments to evaluate the performance 
of PCCMH in comparisons with some state-of-the-art meth¬ 
ods. The conlusion and the discussion on future work are 
presented in Section]^ 


2. METHODOLOGY 


In this section, we describe the details of the proposed PC¬ 
CMH method. We first focus on the definition of hashing 
problem with partially corresponded modalities in Section 
IH] The details of local smoothness preservation and hashing 
with corresponded modalities are presented in Section 2.2 


* In this paper, cross-modal and multimodal refer to the same meaning. 
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Big Ben is the nickname 
for the Great Bell of the 
clock at the north end of the 
Palace of Westminster in 
London, and often extended 
to refer to the clock and the 
clock tower. 
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Fig. 1. Main Framework of PCCMH. In each modality, an anchor graph is built to preserve the local smoothness. For well- 
corresponded data, we map the objects into Hamming spaces. The hashing function is learned via maximizing both the local 
smoothness and cross-modal correspondence. 


and |2.3| Finally, we can derive the optimization problem of 
PCCMH in Section 

2.1. Notations and Problem Definition 

For ease of presentation, assume we have two modalities 
Mx C and My C where dx and dy are the dimen¬ 
sions of feature space in each modality. It is easy to extend our 
model to the hashing problem with more than two modalities. 
Let X and Y denote the training entities in two modalities 
where X — jxi,X 2 , C M^x and 

y = {yi,y 2 , -^yn.yn+i, ■■■,yny},yi e My. A fraction of 
data in X and Y are corresponded pairs indicating the same 
objects which are represented by {xi,yi]^i = 1,2, ...,n 
while the rest of them are not. The task for PCCMH is 
to learn two hashing functions for two modalities: f{x) : 
7 ^<ix g(^y-^ . jidy {—1,1}'=, where c is 

the length of the binary hash cod^ These two functions 
map the feature vectors in each modality into the same Ham¬ 
ming space. The goal is to make the data objects similiar in 
multiple modalities have the same or similiar hash codes. 

2.2. Local Smoothness Preserving with Non-corresponded 
Modality 

As shown in Fig|T] we tend to capture the local smoothness 
as Spectral Hashing ID does which can effective make use 
of the partially and fully corresponded data. The data similiar 
in original feature space should be similiar in Hamming space 
after hashing. The first step is to build the smiliarity matrix W 
in each modality. However, this procedure takes O(n^) time 
complexity where n is the number of data. It is unacceptable 
for large-scale dataset. Therefore, we adopt the Anchor Graph 

^In order to obtain a feasible solution, we use {—1,1} instead of {0,1} 
as binary code. 


13 to exploit the similiarity of the data effectively with 0(n) 
complexity. 

Take the modality X as example, the anchor graph 
uses mx{mx <C Ux) landmarks (i.e. anchors){},i = 
G X. The anchors can be effectively obtained 
by scalable K-mean algorithm m with 0{nx) complexity. 
The similarity between the training data and anchors can be 
computed as: 

i = l,...,nx,j = ( 1 ) 

Therefore, each data point is transformed into a new vector 
whose element is the similiarity between the data and the an¬ 
chors. 

With the support from anchor graph, the similarity matrix 
Wx can be approximately obtained as follows: 

Wx = ZxA-\Zxf,A = diag{{Zxfl) (2) 

In order to preserve the local smoothness in hashing, we 
minimize the following equation: 

Tlx 

min ^ (3) 

ij'=l 

Although many kinds of functions can be used to define 
f{x), we adopt the commonly-used linear hashing scheme: 
f{x) = sgn{xBx) where sgn{-) denotes the element-wise 
sign function and Bx is a linear matrix to learn. With the 
help from Anchor Graph, we have f{X) = sgn{XBx) = 
sgn{ZxBx),X = (xi,..., a:„ J. According to Eq. i the lo¬ 
cal smoothness in Eq. |^can be rewritten as follows in matrix 
form: 

min tr{sgn{ZxBx)'^ LxSgn{ZxBx)) (4) 

where Lx = Dx — Wx and Dx is the diagnal matrix whose 
elements are sum of each row in Wx- It is NP-hard to directly 
















obtain the hashing problem in Eq. We apply the relaxing 
form HI to solve it: 

min 

s.t. {BxY'{B^) = nj.c 

where Ic denotes an identity matrix of size c x c. 

Since Wx can be approximated with Z, Eq. [^can be writ¬ 
ten as: 

min tr{{Bx)^ Lx{Bx)) 

„ ~ ( 6 ) 

s.t. {Bx)'^{Bx) = Uxlc 

where Lx = {Zx)'^{Zx) - {Zx)'^{Zx)A-^{Zx)'^{Zx) is the 

reduced graph Laplacian. In order to maximize the local 
smoothness in both two modalities, we have: 

min tr{{Bx)'^ Lx{Bx)) + tr{{By)'^ Ly{By)) 

s.t. {Bxf{Bx) = Uxlc (7) 

(By) (By) = Uylc 

where Ly is the reduced graph Laplacian obtained by the An¬ 
chor Graph in modality Y. 

2.3. Hashing with Corresponded Modalities 

Eor the training data in good correspondence (i.e. data pairs 
{xi,yi},i = 1,2, ...,n are provided), we firstly transform 
them with the Anchor Graph into similarity matrix Z^ G 
and Z^ G where rux and ruy are the num¬ 

bers of anchors in two modalities respectively. For the data 
pair {xi, Ui}, the hash code should be similiar in Hamming 
space. The maximization of the cross-modal correspondence 
can be obtained via the following optimization problem: 

min \\sgn{Z^Bx) - sgn{Z^By)\\l, (8) 

Using the relaxing form mentioned in Section [Z2| th op¬ 
timization problem is rewritten as follows: 

min WZl^K - Z^WyWl 

s.t. {Bx)'^{Bx) = nxlc ( 9 ) 

(By) (By) = Uylc 

The optimization problem above can minimize the difference 
on two representation of an object in different modality. 


2.4. Final Optimization Problem 

Combining optimization problems in Eq|^and|^ we have the 
final problem as follows: 

min \\Z::Bx-Z^By\\l 

\{tr{{Bx) Lx{Bx)) + tr({By) Ly{By))) 

s.t. {Bx)'^{Bx) = Uxlc 

(By) (By) = Uylc 


where A is the balance coefficient between local smoothness 
and cross-modality hashing. We can transform the cross¬ 
modality term as follows: 


\\Z:^Bx-Z^By\^ ^ ^ 

=tr{{Z^Bx - Z^%f{Z:^Bx - Z^By)) 

Therefore, the final objective function is: 

llZ^^S: - Z^ By II], 

+ X{tr{{lfxfrx{lfx)) + tr({WyfLy{Wy))) 

=triiZ^Bx - Z^Byf{Z^% - Z^%) (12) 

+ \{tr{{BxfL'x{Bx)) + tr{(ByfZy(By))) 

=tr{B'^ZB + XB'^LB) 


where B = 


T '—T 

, ; R, 1^ and 


Z = 




Lx 0 
0 Ly 


The final optimization problem in EqfTO]becomes: 
min tr{B"'"{Z + XL)B) 
s.t. B^B = I 


(13) 


Eq|T^is an eigenvalue problem. The optimal W is the 
eigenvectors corresponding to the 2c smallest eigenvalues of 
Z + XL. We can solve it to obtain the hashing functions f{x) 
and g{y). 


3. EXPERIMENTS 

In this section, we evalute the performance of PCCMH by 
conducting extensive experiments and compare it with some 
state-of-the-art methods. All experiments are conducted on a 
workstation with Xeon (R) CPU W3503@2.40GHz and 6GB 
RAM. 

3.1. Datasets 

We conduct the experiments on two widely-used datasets: 

• Wiki: Wiki-dataset is generated from 2866 Wikipedia 
documents El. Each document is a couple of image 
and text keywords. The images are represented by 
128D SIFT BoW features while the text key words are 
represented by lOD features whose elements are topics 
extracted by Latent Dirichlet Allocation. 

• NUS-WIDE: NUS-WIDE dataset is a large-scale web 
image dataset crawled from Flickr im. It contains 
269,648 images associated with 81 semantic concepts. 
Each image is represented by 500D SIFT BoW features 
and lOOOD textual feature obtained by performing PC A 
on the original tag occurrence features. 
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Fig. 2. MAP Result of PCCMH on Wiki Dataset with differ¬ 
ent amount of corresponding information 

3.2. Baselines and Evaluation Scheme 

In this paper, we evaluate the proposed method with two typ¬ 
ical cross-modal retrieval tasks: querying image database by 
some text words (text-to-image) and query the text database 
by images (image-to-text). Some state-of-the-art cross- 
modal hashing schemes are used for comparison including 
MMSHia, CCAini, CRHEl and MLBEiJ. 

The performance of retrieval is evaluated with Mean Av¬ 
erage Precision (MAP).For a query g.the average precision is 
dehned as: 

1 ^ 

( 14 ) 

^ r—1 

where Lq is the number of ground truth neighbours in the 
retrieved list. Pq{r) is the precision of the top r retrieved 
results and 5q{r) = 1 if the r-th result is the true neighbour 
and 0 otherwise. R is set to 50 in this paper. 

3.3. Accuracy on Wiki Dataset 

For Wiki Dataset, we conduct two experiments for evalua¬ 
tion: (1) We randomly select 60% of the training data of 
which the correspondence in both text and image modality is 
available. For the rest, the correspondence is not provided for 
training. (2) We vary the number of corresponded data from 
20% to 80% of the whole dataset to evaluate the performance 
of PCCMH. For other baselines, the whole correspondences 
are given. The number of landmarks {rrix and my) in anchor 
graph in 200. The balance coefficient in Fq|T^is 0.6. The ex¬ 
periment is repeated for 10 times. As shown in Table [T] PC¬ 
CMH achieves the best or closed performance in comparison 
with the baselines when 60 % of correspondence information 
is available. 

The performance of PCCMH with varying correspon¬ 
dence information is presented in Fig|^ When varying the 
ratio of corresponded data from 20% to 80%, the performance 
of PCCMH is increasing both in text-to-image and image-to- 
text retrieval. However, the improvement is not obvious with 
the ratio larger than 50%. Since the number of class in Wiki 
dataset is limited, 50 % of the dataset can cover most of di¬ 
versity. The preservation of local smoothness provides good 


Table 1. MAP Result on Wiki Dataset 


Task 

Method 

Code Length (bit) 

c= 16 

c = 24 

c= 32 

Image Query 

v.s. 

Text Dataset 

PCCMH 

0.1753 

0.1774 

0.1657 

CCA 

0.1658 

0.1532 

0.1558 

CRH 

0.1370 

0.1605 

0.1398 

MLBE 

0.1573 

0.1751 

0.1793 

MMSH 

0.1684 

0.1617 

0.1624 

Text Query 

v.s. 

Image Dataset 

PCCMH 

0.1839 

0.2010 

0.1904 

CCA 

0.1658 

0.1532 

0.1558 

CRH 

0.1341 

0.1605 

0.1398 

MLBE 

0.1827 

0.1624 

0.2107 

MMSH 

0.1707 

0.1824 

0.1724 


generalization of the hashing model. 


3.4. Accuracy on NUS-WIDE Dataset 

We randomly select 5,000 samples from NUS-WIDE dataset 
in which 4,000 samples are used for training and the rest are 
used for testing. For PCCMH, 60% of correspondence in¬ 
formation is available. The parameters of PCCMH for NUS- 
WIDE is as same as Wiki dataset. As shown in Table PC¬ 
CMH still outperforms the baselines in most cases with 60% 
of correspondence information. 


Table 2. MAP Result on NUS-WIDE Dataset 


Task 

Method 

Code Length (bit) 

c= 16 

c = 24 

c = 32 

Image Query 

v.s. 

Text Dataset 

PCCMH 

0.2334 

0.2281 

0.2172 

CCA 

0.2214 

0.2212 

0.2264 

CRH 

0.2123 

0.2010 

0.1865 

MLBE 

0.1840 

0.2030 

0.2222 

MMSH 

0.1920 

0.2245 

0.2364 

Text Query 

v.s. 

Image Dataset 

PCCMH 

0.2261 

0.2311 

0.2349 

CCA 

0.2208 

0.2211 

0.2285 

CRH 

0.2233 

0.2305 

0.2260 

MLBE 

0.2188 

0.2064 

0.2226 

MMSH 

0.1920 

0.2425 

0.2164 


4. CONCLUSION 

Most existing cross-modal hashing methods need full corre- 
spondended data to learn the hashing function. In this pa¬ 
per, we propose PCCMH using partially correponded infor¬ 
mation for cross-modal hashing. Experiments on Wiki and 
NUS-WIDE dataset demonstrate the feasibility and good per¬ 
formance of PCCMH. 
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